Sentiment Analysis on Text¶

Introduction¶

Sentiment analysis is a central aspect of NLP, from predicting stock quotes to understanding twitter tweets, we use sentiment analysis to understand the data in our lives. Sentiment analysis is a crucial tool in exploring the language around us in a Machine Learning context; by looking at words and there respective semantics we are able to analyze and decipher large datasets that could never be read entirely by hand. A business can use Sentiment Analysis to study brand awareness or trending opinions.

We used a couple of different approaches to the data, after first processing the data, and analyzing the data through charts and graphs; we looked into the relevant types of Machine Learning algorithms.

Overview¶

After cleaning up the text data by getting rid of characters, stop words, and making consitent casing; the text was presentable for data processing. We then added features which we analized through charts and graphs.

Columns¶

Sentiment: a score of 'positive', 'negative', and 'neutral'

Text: The various tweets we are studying

Unnamed: 2 This one seems superfulous with only one or two values.

Problem Statement and Objective¶

Our objective is to create a semi-supervised machine learning model that has the ability to categorize tweets by sentiment. To achieve this, we will be working with a dataset that has been collected. The goal of the model is to predict the sentiment of various tweets.

About the Dataset¶

The dataset is collected from tweets that have pre-applied sentiment analysis for each tweet. dataset here

Solution and Analysis¶

Prepare the tools¶

Here we import relevant packages.

In [ ]:
import numpy as np
import pandas as pd
import nltk
import plotly.io as pio
import plotly.graph_objects as go
pio.templates.default = "plotly_white"
from sklearn.feature_extraction.text import CountVectorizer
from textblob import TextBlob
import plotly.express as px
import re
import string
from nltk.stem import WordNetLemmatizer
string.punctuation
from nltk.corpus import stopwords
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

Setup plotly¶

This function is required for plotly to run properly in colab without it sometimes the plots don't render properly.

To avoid code repeatability, enable plotly in cell function is created

In [ ]:
def enable_plotly_in_cell():
  import IPython
  from plotly.offline import init_notebook_mode
  display(IPython.core.display.HTML('''<script src="/static/components/requirejs/require.js"></script>'''))
  init_notebook_mode(connected=False)

Load Data¶

In [ ]:
df = pd.read_csv ('dataset_semi.csv')
df
Out[ ]:
Text Sentiment Unnamed: 2
0 Such an easy app to use, really quick and easy... positive NaN
1 The drivers and the services have been excepti... positive NaN
2 All rides have been enjoyable. positive NaN
3 Driver very knew where I was neutral NaN
4 most driver's are child friendly and patient. positive NaN
... ... ... ...
5917 My Liked Songs can only display all my songs i... neutral NaN
5918 Although it can be a little annoying in the fr... negative NaN
5919 It isn't about the catalogue..it's about the c... positive NaN
5920 Except for the fact that I can't open my downl... negative NaN
5921 This app stinks too many interruptions and upg... negative NaN

5922 rows × 3 columns

Our datafile has our Text, and Sentiment, and a out of place column; unnamed: 2. And a complete set of observations. Reletively clean

Understanding the data¶

Descriptive Analysis

In [ ]:
df.shape
Out[ ]:
(5922, 3)
In [ ]:
print(f'We see the dataset has {df.shape[0]} observations and over {df.shape[1]} features.')
We see the dataset has 5922 observations and over 3 features.

Review the data and sample data

In [ ]:
df.head()
Out[ ]:
Text Sentiment Unnamed: 2
0 Such an easy app to use, really quick and easy... positive NaN
1 The drivers and the services have been excepti... positive NaN
2 All rides have been enjoyable. positive NaN
3 Driver very knew where I was neutral NaN
4 most driver's are child friendly and patient. positive NaN
In [ ]:
df.sample(5)
Out[ ]:
Text Sentiment Unnamed: 2
1341 Its a good app for online study and mettings . positive NaN
1875 An extraordinary accuracy via gps , a must if ... positive NaN
4830 This is a nice app but many more features are ... positive NaN
3328 Very well succeeded, i Love it! positive NaN
2944 No latency or poor connection s negative NaN

Check datatype and information regarding all different features¶

We want to look at the datatypes and check to see if they were interpreted correctly.

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5922 entries, 0 to 5921
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Text        5922 non-null   object
 1   Sentiment   5922 non-null   object
 2   Unnamed: 2  1 non-null      object
dtypes: object(3)
memory usage: 138.9+ KB

Learning the data mathematically

In [ ]:
df.describe(include = 'all')
Out[ ]:
Text Sentiment Unnamed: 2
count 5922 5922 1
unique 5871 5 1
top My music is only a search away! negative positive
freq 3 2186 1

Data Pre-processing¶

From the information we reviewed we could see there are no duplicates. However, we can check again and handle it if required.

Look for missing data

In [ ]:
df.isnull().sum()
Out[ ]:
Text             0
Sentiment        0
Unnamed: 2    5921
dtype: int64

No Null values, in the Text and Sentiment columns

Check for duplicated records

In [ ]:
print(f'There are {df.duplicated().sum()} duplicated rows in the dataset.')
There are 43 duplicated rows in the dataset.
In [ ]:
df[df['Text'].duplicated() == True]
Out[ ]:
Text Sentiment Unnamed: 2
151 Poor sync. negative NaN
951 working on it neutral NaN
1599 I don't see any reason to install! negative NaN
1877 now after this update i can't dismiss ads.. negative NaN
1884 The add songs feature in playlists doesnt let ... negative NaN
1889 slightly disappointed in you need to buy premi... negative NaN
1907 Too many ads, same ads repeated over and over ... negative NaN
1913 The app have to pay to choose your song of yo... neutral NaN
1930 Bring back my memories with old song with good... neutral NaN
1933 Whenever I go to play a track my app just cras... negative NaN
1996 GPs fail to show properly, drivers are mistake... negative NaN
2021 I'm about to stop using it period! negative NaN
2034 very bad app negative NaN
2043 But the update on my iPhone has been hanging f... negative NaN
2045 Habs already restarted does not bring anything negative NaN
2056 Love the new app! The design is super, and it'... positive NaN
2078 Chat heads don't appear sometimes but over all... negative NaN
2101 An example of what is possible without meeting... negative NaN
2353 Wont change my review until your CS contact me... neutral NaN
3546 Please make this game compatible with iPhone 6... neutral NaN
3554 No respond from them yet on that they are doin... neutral NaN
3677 From the functioncni app, totalne stilted and ... positive NaN
3834 Photos are sent in horrendous quality, with ch... negative NaN
3932 Premium is nice because it's affordable. positive NaN
4356 How in the hell does this app not have dark mo... neutral NaN
4364 Facebook changed a lot through years. neutral NaN
4374 Which incest developer has made such nonsense,... negative NaN
4383 After this last update, when I put my e-mail a... neutral NaN
4384 Unfortunately, despite almost weekly updates o... negative NaN
4427 One more reason to consider when deciding why ... neutral NaN
4430 What will it take to remove this feature? neutral NaN
4434 I've been using the app for several years now. neutral NaN
4528 Help, unwrite me off the paid version! Please! negative NaN
4910 I watch all Yt, etc., only through here to not... negative NaN
5323 I am 19 and why can't i sing up neutral NaN
5358 My music is only a search away! neutral NaN
5399 Wish the app will allow you to add another rou... positive NaN
5454 Your estimated time of arrival is always not c... negative NaN
5459 I call him, message him, he doesn't pick up th... negative NaN
5526 My music is only a search away! neutral NaN
5528 My songs doesn't play automatically.I already ... negative NaN
5529 This used to be a wonderful App. positive NaN
5530 Lately it has been very slow and unresponsive. negative NaN
5531 It barely functions for me and many other user... negative NaN
5532 So unfortunately at this time I would not reco... negative NaN
5533 Very nice app enjoy it very much positive NaN
5534 Spotify has their own DC Universe access deals... neutral NaN
5535 It's official, I hate Spotify multiple people ... negative NaN
5536 Very disappointing to have drivers pick a requ... negative NaN
5538 I just luv their customer service positive NaN
5581 This update doesn t allow me to see my homepag... negative NaN

Double check if the values are actually duplicate and make sure the sentiment is not different

In [ ]:
df.loc[df['Text'] == 'Poor sync.']
Out[ ]:
Text Sentiment Unnamed: 2
110 Poor sync. negative NaN
151 Poor sync. negative NaN
In [ ]:
df.loc[df['Text'] == 'very bad app']
Out[ ]:
Text Sentiment Unnamed: 2
981 very bad app negative NaN
2034 very bad app negative NaN

Drop all duplicates

In [ ]:
df.drop_duplicates(keep=False, inplace=True)
In [ ]:
df
Out[ ]:
Text Sentiment Unnamed: 2
0 Such an easy app to use, really quick and easy... positive NaN
1 The drivers and the services have been excepti... positive NaN
2 All rides have been enjoyable. positive NaN
3 Driver very knew where I was neutral NaN
4 most driver's are child friendly and patient. positive NaN
... ... ... ...
5917 My Liked Songs can only display all my songs i... neutral NaN
5918 Although it can be a little annoying in the fr... negative NaN
5919 It isn't about the catalogue..it's about the c... positive NaN
5920 Except for the fact that I can't open my downl... negative NaN
5921 This app stinks too many interruptions and upg... negative NaN

5837 rows × 3 columns

Check duplicates again

In [ ]:
df[df['Text'].duplicated() == True]
Out[ ]:
Text Sentiment Unnamed: 2
1599 I don't see any reason to install! negative NaN
2021 I'm about to stop using it period! negative NaN
2043 But the update on my iPhone has been hanging f... negative NaN
2045 Habs already restarted does not bring anything negative NaN
2101 An example of what is possible without meeting... negative NaN
4374 Which incest developer has made such nonsense,... negative NaN
4383 After this last update, when I put my e-mail a... neutral NaN
4427 One more reason to consider when deciding why ... neutral NaN

We stil have duplicates

Let's review these duplicates

In [ ]:
df.loc[df['Text'] == 'But the update on my iPhone has been hanging for hours.']
Out[ ]:
Text Sentiment Unnamed: 2
1974 But the update on my iPhone has been hanging f... neutral NaN
2043 But the update on my iPhone has been hanging f... negative NaN
In [ ]:
df.loc[df['Text'] == 'I\'m about to stop using it period!']
Out[ ]:
Text Sentiment Unnamed: 2
2010 I'm about to stop using it period! neutral NaN
2021 I'm about to stop using it period! negative NaN

For the reamining duplicates we can see that the sentiment was captured differently. We have two same texts with different sentiments. We will treat this by updating them to one sentiment.

In [ ]:
df.loc[df['Text'] == 'I\'m about to stop using it period!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'I don\'t see any reason to install!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'I\'m about to stop using it period!', "Sentiment"] = "negative"
df.loc[df['Text'] == 'But the update on my iPhone has been hanging for hours.', "Sentiment"] = "negative"
df.loc[df['Text'] == 'Habs already restarted does not bring anything', "Sentiment"] = "negative"
df.loc[df['Text'] == 'An example of what is possible without meeting on lots of lagg and curl ', "Sentiment"] = "negative"
df.loc[df['Text'] == 'Which incest developer has made such nonsense, heard executed or shared with the horses in 5 ', "Sentiment"] = "negative"
df.loc[df['Text'] == 'After this last update, when I put my e-mail and my password, from the error...', "Sentiment"] = "negative"
df.loc[df['Text'] == 'One more reason to consider when deciding why I should even keep this', "Sentiment"] = "negative"
In [ ]:
df.drop_duplicates(keep=False, inplace=True)
In [ ]:
df[df['Text'].duplicated() == True]
Out[ ]:
Text Sentiment Unnamed: 2

We have treated all the duplicates

We created a new list of all the reviews and cleaned out punctuation to make case consistent

Setting up the file to remove punctuation

In [ ]:
reviews = []
for index, row in df.iterrows():
    reviews.append(row['Text'])

Punctuation removal

In [ ]:
punc = '''@1234567890!()-[]{};:'"\,<>./?@#$%^&*_~'''
reviewscleaned = []
for review in reviews:
    no_punct = ""
    for char in review:
       if char not in punc:
           no_punct = no_punct + char
    reviewscleaned.append(no_punct)
In [ ]:
print(reviewscleaned)

Importing stop words, we found we needed to add spanish stop words to the lexicon because many tweets weren't in english

In [ ]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
sw_nltk = stopwords.words('english')
sw_nltk1 = stopwords.words('spanish')
newreviews = []
for text in reviewscleaned:
    words = [word for word in text.split() if word.lower() not in (sw_nltk or sw_nltk1)]
    new_text = " ".join(words)
    newreviews.append(new_text)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

Checking word

In [ ]:
print(newreviews)

We created a cleaned dataframe called df2

In [ ]:
df2 = pd.DataFrame()
df2["Sentiment"] = df["Sentiment"]
df2["Text"] = newreviews
df2
Out[ ]:
Sentiment Text
0 positive easy app use really quick easy set absolutely ...
1 positive drivers services exceptional since ever
2 positive rides enjoyable
3 neutral Driver knew
4 positive drivers child friendly patient
... ... ...
5917 neutral Liked Songs display songs sort recently added
5918 negative Although little annoying free version WAY bett...
5919 positive isnt catalogueits curation Spotify
5920 negative Except fact cant open downloaded albums Im Off...
5921 negative app stinks many interruptions upgrades good do...

5821 rows × 2 columns

In [ ]:
sentiment = []
for index, row in df.iterrows():
    sentiment.append(row['Sentiment'])

We apply the stop words to the corpus

In [ ]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
final_stopwords_list = stopwords.words('english') + stopwords.words('spanish')
input_vector = TfidfVectorizer (max_features=3000, min_df=6, max_df=0.8, stop_words=final_stopwords_list)
newreviews = input_vector.fit_transform(newreviews).toarray()

Here we found some odd results in the Sentiments column of the data

In [ ]:
df2['Sentiment'].value_counts()
Out[ ]:
negative                                                                           2134
positive                                                                           2120
neutral                                                                            1565
 it adds a lot of great options by opening doors to new places and experiences.       1
-                                                                                     1
Name: Sentiment, dtype: int64

We see some non ascii characters in the dataset. This is bad data we will take care of it

In [ ]:
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]', ' ', text)
In [ ]:
df2['Sentiment'] = df2['Sentiment'].apply(remove_non_ascii)

Capturing the correct set of elements

In [ ]:
df_filtered = df2.loc[df2['Sentiment'].isin(['positive', 'neutral', 'negative'])]
In [ ]:
df_filtered['Sentiment'].value_counts()
Out[ ]:
negative    2134
positive    2120
neutral     1565
Name: Sentiment, dtype: int64

This is where we created the datafile for most of the work

In [ ]:
df3 = df_filtered
In [ ]:
df3.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5819 entries, 0 to 5921
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Sentiment  5819 non-null   object
 1   Text       5819 non-null   object
dtypes: object(2)
memory usage: 136.4+ KB
In [ ]:
df3['Sentiment'].isnull().sum
Out[ ]:
<bound method NDFrame._add_numeric_operations.<locals>.sum of 0       False
1       False
2       False
3       False
4       False
        ...  
5917    False
5918    False
5919    False
5920    False
5921    False
Name: Sentiment, Length: 5819, dtype: bool>

We added character length and word length

Adding features character length and word length

In [ ]:
df3['Character_Length'] = df['Text'].str.len()
df3['Word_Length'] = df['Text'].str.count(' ') + 1

Checking value counts for character lengths

In [ ]:
df3['Character_Length'].value_counts()
Out[ ]:
38      100
34       99
46       97
40       94
35       92
       ... 
2123      1
378       1
476       1
469       1
278       1
Name: Character_Length, Length: 353, dtype: int64
In [ ]:
df3['Word_Length'].value_counts()
Out[ ]:
6      591
7      499
8      461
9      394
11     363
      ... 
100      1
68       1
149      1
101      1
71       1
Name: Word_Length, Length: 92, dtype: int64
In [ ]:
df3.sample()
Out[ ]:
Sentiment Text Character_Length Word_Length
268 neutral Particularly widget hand home screen ability a... 145 24
In [ ]:
df3.head()
Out[ ]:
Sentiment Text Character_Length Word_Length
0 positive easy app use really quick easy set absolutely ... 101 20
1 positive drivers services exceptional since ever 61 10
2 positive rides enjoyable 30 5
3 neutral Driver knew 28 6
4 positive drivers child friendly patient 45 7

Exploratory Data Analysis¶

A sentiment pie chart

In [ ]:
enable_plotly_in_cell()

sentiment = df3['Sentiment'].value_counts()

fig = px.pie(sentiment,
             values = sentiment.values,
             names = sentiment.index,
             color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textinfo='percent+label',
                  marker = dict(line = dict(color = 'white', width = 5)))
fig.show()

WordCloud for top words

In [ ]:
text = ' '.join(df3['Text'])
wordcloud = WordCloud(background_color="white").generate(text)
plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [ ]:
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
In [ ]:
enable_plotly_in_cell()
common_words = get_top_n_words(df['Text'], 20)


df_unigram_20 = pd.DataFrame(common_words, columns = ['Word' , 'count']).sort_values(by="count",ascending=False).reset_index(drop=True)

fig = px.bar(df_unigram_20, x='Word', y='count')
fig.update_layout(
    title={
        'text': "Top 20 words across all cases",
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'})
fig.show()
In [ ]:
def avg_word(sentence):
    words = sentence.split()
    if len(words) > 0:
      return (sum(len(word) for word in words)/len(words))
    return 0
In [ ]:
def avg_word_length(df):
    df['avg_word'] = df['Text'].apply(lambda x: avg_word(x))
    print(df[['Text','avg_word']].head())
In [ ]:
avg_word_length(df3)
                                                Text  avg_word
0  easy app use really quick easy set absolutely ...      5.30
1            drivers services exceptional since ever      7.00
2                                    rides enjoyable      7.00
3                                        Driver knew      5.00
4                     drivers child friendly patient      6.75
In [ ]:
df3
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word
0 positive easy app use really quick easy set absolutely ... 101 20 5.300000
1 positive drivers services exceptional since ever 61 10 7.000000
2 positive rides enjoyable 30 5 7.000000
3 neutral Driver knew 28 6 5.000000
4 positive drivers child friendly patient 45 7 6.750000
... ... ... ... ... ...
5917 neutral Liked Songs display songs sort recently added 75 15 5.571429
5918 negative Although little annoying free version WAY bett... 87 17 6.125000
5919 positive isnt catalogueits curation Spotify 72 11 7.750000
5920 negative Except fact cant open downloaded albums Im Off... 84 16 5.222222
5921 negative app stinks many interruptions upgrades good do... 111 17 6.230769

5819 rows × 5 columns

In [ ]:
def hash_tags(df):
    df['hashtags'] = df['Text'].apply(lambda x: len([x for x in x.split() if x.startswith('#')]))
    print(df[['Text','hashtags']].head())
In [ ]:
hash_tags(df3)
                                                Text  hashtags
0  easy app use really quick easy set absolutely ...         0
1            drivers services exceptional since ever         0
2                                    rides enjoyable         0
3                                        Driver knew         0
4                     drivers child friendly patient         0

Modelling¶

Use the latest model for Modelling

In [ ]:
newreviews2 = input_vector.transform(df3['Text']).toarray()
In [ ]:
sentiment2 = []
for index, row in df3.iterrows():
    sentiment2.append(row['Sentiment'])
In [ ]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(newreviews2,sentiment2,train_size=0.8)
In [ ]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_classifier_score = rf_classifier.score(X_train, y_train)
rf_classifier_score
Out[ ]:
0.9935553168635876
In [ ]:
from sklearn.svm import SVC
svc_classifier = SVC(kernel='linear')
svc_classifier.fit(X_train, y_train)
svc_classifier_score = svc_classifier.score(X_train, y_train)
svc_classifier_score
Out[ ]:
0.8977443609022556
In [ ]:
import sklearn.linear_model as sk
lr_classifier = sk.LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr').fit(X_train, y_train)
lr_classifier_score = lr_classifier.score(X_train, y_train)
lr_classifier_score
Out[ ]:
0.8850698174006445
In [ ]:
from sklearn.metrics import accuracy_score
rf_test = rf_classifier.predict(X_test)
accuracy_scorerf = accuracy_score(y_test, rf_test)
print(accuracy_scorerf)
0.7551546391752577
In [ ]:
svc_test = svc_classifier.predict(X_test)
accuracy_scoresvc = accuracy_score(y_test, svc_test)
print(accuracy_scoresvc)
0.7792096219931272
In [ ]:
lr_test = lr_classifier.predict(X_test)
accuracy_scorelr = accuracy_score(y_test,lr_test)
print(accuracy_scorelr)
0.7783505154639175
In [ ]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
In [ ]:
lr_confusion_matrix = confusion_matrix(y_test, lr_test)
In [ ]:
sns.heatmap(lr_confusion_matrix, annot=True, fmt='g');  #annot=True to annotate cells, ftm='g' to disable scientific notation
In [ ]:
svc_confusion_matrix = confusion_matrix(y_test, svc_test)
In [ ]:
rf_confusion_matrix = confusion_matrix(y_test, rf_test)
In [ ]:
lr_test
Out[ ]:
array(['negative', 'positive', 'negative', ..., 'negative', 'negative',
       'positive'], dtype='<U8')

Reviewing Confusion Matrix based on Each Model¶

Plot bar graph with each model

In [ ]:
enable_plotly_in_cell()
accuracy_of_models = {'SVC': svc_classifier_score,
                      'Random Forest': rf_classifier_score,
                      'Logistic Regression': lr_classifier_score}


fig = px.bar(x = list(accuracy_of_models.keys()), y= list(accuracy_of_models.values()),
             color = list(accuracy_of_models.values()),
             width = 800, height = 400,
             color_discrete_sequence=px.colors.qualitative.G10,
             labels={'x':'Classifier', 'y':'Accuracy'}, text_auto=True)


fig.update_layout(title='Accuracy performance of classification models')
fig.show()

Compare confusion matrix for each model

In [ ]:
confusion_matrix_of_models = {'SVC': svc_confusion_matrix,
                      'Random Forest': rf_confusion_matrix,
                      'Logistic Regression': lr_confusion_matrix}
In [ ]:
enable_plotly_in_cell()
scores = [accuracy_scorerf, accuracy_scoresvc, accuracy_scorelr]

best_score = max(scores)
best_model = ''

for key, value in accuracy_of_models.items():
    best_model = key
    fig = px.imshow(confusion_matrix_of_models[key], text_auto=True, aspect="auto",
                    color_continuous_scale='viridis',
                    labels=dict(x="Actual", y="Prediction"))

    fig.update_layout(title = f'{key} Matrix', height=500, width=800)
    fig.show()

    confusion_matrix_of_models.pop(key)
In [ ]:
best_score
Out[ ]:
0.7792096219931272
In [ ]:
from IPython.display import Markdown
In [ ]:
Markdown(f"""
#### From the results above we can see that {best_model} perfoms best with the highest accuracy of {round(best_score * 100, 2)}%""")
Out[ ]:

From the results above we can see that Logistic Regression perfoms best with the highest accuracy of 77.92%¶

In [ ]:
df3
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
0 positive easy app use really quick easy set absolutely ... 101 20 5.300000 0
1 positive drivers services exceptional since ever 61 10 7.000000 0
2 positive rides enjoyable 30 5 7.000000 0
3 neutral Driver knew 28 6 5.000000 0
4 positive drivers child friendly patient 45 7 6.750000 0
... ... ... ... ... ... ...
5917 neutral Liked Songs display songs sort recently added 75 15 5.571429 0
5918 negative Although little annoying free version WAY bett... 87 17 6.125000 0
5919 positive isnt catalogueits curation Spotify 72 11 7.750000 0
5920 negative Except fact cant open downloaded albums Im Off... 84 16 5.222222 0
5921 negative app stinks many interruptions upgrades good do... 111 17 6.230769 0

5819 rows × 6 columns

We added two entries that were removed from the dataset.

In [ ]:
new_entry = {'Text': "None", 'Sentiment': 'negative', "Unnamed: 2": np.NaN,'Character_Length': np.NaN, 'Word_Length': np.NaN}
df3.loc[578] = new_entry
df3.loc[1837] = new_entry

Here we create a radom arrangement of None values in the sentiment category so as to create a semi-labaled dataset for a semi-supervised model to train on. Sean wrote this with the help of open ai's generative text processor.

In [ ]:
percent_sampled = 27
sample_df = df3.sample(frac = percent_sampled/100)
In [ ]:
sample_df
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
4457 negative worst appit many time says please check connec... 59.0 10.0 5.375000 0.0
3759 neutral real time exchange rates currently one option ... 92.0 17.0 5.545455 0.0
1060 positive Gud useful online classes 39.0 7.0 5.500000 0.0
1925 negative like stuff please ask language preferences Ind... 124.0 22.0 6.090909 0.0
4972 positive oBsuzjsiznOAmsy visbss see mm es la PayPal hotel 57.0 11.0 5.125000 0.0
... ... ... ... ... ... ...
5473 negative Theres reason streaming app take GB 56.0 12.0 5.000000 0.0
5484 neutral Ive tried every option Gmail using phone number 60.0 11.0 5.000000 0.0
4883 negative never get notifications pagesfrustrating mobil... 153.0 26.0 7.000000 0.0
1957 neutral used possible choose nickname set rambling let... 130.0 25.0 6.000000 0.0
4268 negative post posted anything bad 49.0 12.0 5.250000 0.0

1572 rows × 6 columns

In [ ]:
df3.loc[df3.Text.isin(sample_df.Text), "Sentiment"] = "None"
In [ ]:
df3
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
0 positive easy app use really quick easy set absolutely ... 101.0 20.0 5.300000 0.0
1 positive drivers services exceptional since ever 61.0 10.0 7.000000 0.0
2 positive rides enjoyable 30.0 5.0 7.000000 0.0
3 None Driver knew 28.0 6.0 5.000000 0.0
4 positive drivers child friendly patient 45.0 7.0 6.750000 0.0
... ... ... ... ... ... ...
5919 positive isnt catalogueits curation Spotify 72.0 11.0 7.750000 0.0
5920 negative Except fact cant open downloaded albums Im Off... 84.0 16.0 5.222222 0.0
5921 negative app stinks many interruptions upgrades good do... 111.0 17.0 6.230769 0.0
578 negative None NaN NaN NaN NaN
1837 negative None NaN NaN NaN NaN

5821 rows × 6 columns

In [ ]:
df3.head(25)
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
0 positive easy app use really quick easy set absolutely ... 101.0 20.0 5.300000 0.0
1 positive drivers services exceptional since ever 61.0 10.0 7.000000 0.0
2 positive rides enjoyable 30.0 5.0 7.000000 0.0
3 None Driver knew 28.0 6.0 5.000000 0.0
4 positive drivers child friendly patient 45.0 7.0 6.750000 0.0
5 None Quick easy use drivers quite friendly 😊 54.0 11.0 4.714286 0.0
6 None Love Appits easy☮️shows person drive u Name 88.0 17.0 5.285714 0.0
7 None Best drivers ever 17.0 3.0 5.000000 0.0
8 positive Good travel app 24.0 5.0 4.333333 0.0
9 positive Cabs r clean drivers 26.0 6.0 4.250000 0.0
10 positive Love rides 14.0 3.0 4.500000 0.0
11 positive Fast affordable efficient means get destinatio... 72.0 12.0 7.000000 0.0
12 positive Perfect transport 17.0 2.0 8.000000 0.0
13 negative rider vey wicked use add money 58.0 13.0 4.166667 0.0
14 positive easiest way find transport safer 52.0 11.0 5.600000 0.0
15 positive Safe travel 19.0 4.0 5.000000 0.0
16 positive Always good ride good drivers 40.0 8.0 5.000000 0.0
17 positive kids loved spacious ride 33.0 6.0 5.250000 0.0
18 positive enjoyed ride 17.0 4.0 5.500000 0.0
19 None Clean cars 11.0 2.0 4.500000 0.0
20 positive Best service best prices 29.0 5.0 5.250000 0.0
21 positive Fast convenient friendly drivers smile 52.0 8.0 6.800000 0.0
22 positive never encounted bad experience even drivers pr... 75.0 12.0 7.142857 0.0
23 positive convenient way moving around ever 42.0 7.0 5.800000 0.0
24 None Nice rides ☺️ 14.0 3.0 3.666667 0.0

Importing relevant modules

In [ ]:
from sklearn.linear_model import LogisticRegression

Split the data into labeled and unlabeled

In [ ]:
labeled_data = df3[df3['Sentiment'] != 'None']
unlabeled_data = df3[df3['Sentiment'] == 'None']
In [ ]:
unlabeled_data.sample(20)
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
3715 None Ive experienced simple quick cheap way send mo... 86.0 16.0 5.666667 0.0
5700 None free love 36.0 10.0 4.000000 0.0
1144 None ask youve today ÃÂÂÂ... 555.0 9.0 131.250000 0.0
3188 None play game internet availible 40.0 7.0 6.250000 0.0
3622 None support team tencent 28.0 5.0 6.000000 0.0
1071 None badI cant install phone 35.0 7.0 5.000000 0.0
976 None Bring seen sign shows messag seen without ente... 94.0 16.0 6.111111 0.0
5201 None nice meet communicate others see friends 70.0 15.0 5.833333 0.0
4172 None instant transfers international banks count do... 131.0 19.0 7.250000 0.0
1180 None new layout fine feature loved gone 66.0 15.0 4.833333 0.0
1524 None ‚°ÃÂ... 16349.0 1.0 16349.000000 0.0
5047 None lot ads like ads play next song 67.0 16.0 3.571429 0.0
1175 None love since update everything 47.0 10.0 6.250000 0.0
1348 None great app tution classes 37.0 8.0 5.250000 0.0
1776 None one look antalya Alanya 52.0 10.0 5.000000 0.0
627 None expensive dont know unsubscribe 55.0 10.0 7.000000 0.0
3745 None weeks transfer arrived promised returne back a... 133.0 26.0 6.000000 0.0
4145 None access search tab 24.0 6.0 5.000000 0.0
102 None good app always useful updates ment ease user ... 130.0 22.0 5.800000 0.0
1402 None Dated nonmaterial design UI lacks basic functi... 99.0 15.0 6.363636 0.0
In [ ]:
labeled_data.sample(20)
Out[ ]:
Sentiment Text Character_Length Word_Length avg_word hashtags
2963 positive like fact took highlights Facebook 58.0 12.0 6.000000 0.0
5651 positive love use also phone time working computer 93.0 22.0 5.000000 0.0
5320 positive never used kind app im download app fun classe... 96.0 18.0 5.090909 0.0
1795 positive previous version fits 26.0 4.0 6.333333 0.0
2397 neutral get score coins 33.0 10.0 4.333333 0.0
3998 negative media section select image option see chat lik... 98.0 20.0 5.333333 0.0
5448 positive far enjoying bolt rides 43.0 9.0 5.000000 0.0
5489 positive beauty journey lies drivers 46.0 9.0 6.000000 0.0
4759 positive pretty good game 👌👌 tho could use less b... 405.0 82.0 5.325000 0.0
4775 negative Worst game ever game totally force spend money 58.0 11.0 4.875000 0.0
411 positive worried easy every way Ive tried transfer mone... 124.0 21.0 6.272727 0.0
5696 positive Surely best flexible time tracking apps 60.0 11.0 5.666667 0.0
5658 positive point use app 25.0 6.0 3.666667 0.0
5880 negative new update completely wiped LIKED PLAYLISTdo s... 98.0 15.0 6.400000 0.0
3431 negative PREMIUM since since problem downloaded songs d... 155.0 30.0 6.300000 0.0
1683 positive good lesson app designers 58.0 13.0 5.500000 0.0
1651 positive Excellent content could much better designed i... 67.0 10.0 6.857143 0.0
3766 neutral steps needed send money wallet 37.0 7.0 5.200000 0.0
4909 positive good app although Easy fast love gave three star 71.0 17.0 4.444444 0.0
1667 negative Pubg Mobile Good Game Butt hacker high fast gr... 114.0 20.0 5.000000 0.0

Preprocess and extract features

In [ ]:
labeled_data['Sentiment']
Out[ ]:
0       positive
1       positive
2       positive
4       positive
8       positive
          ...   
5919    positive
5920    negative
5921    negative
578     negative
1837    negative
Name: Sentiment, Length: 4242, dtype: object
In [ ]:
sentiment_labelled = []
for index, row in labeled_data.iterrows():
    sentiment_labelled.append(row['Sentiment'])
In [ ]:
X = input_vector.transform(labeled_data['Text']).toarray()
y = sentiment_labelled
In [ ]:
import sklearn.linear_model as sk
model = sk.LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr').fit(X, y)
model_score = model.score(X, y)
model_score
Out[ ]:
0.8781235266383781

Creating a seperate file to compare to df3 later on

In [ ]:
df4 = df3

Predict labels for the unlabeled data

In [ ]:
X_unlabeled = input_vector.transform(unlabeled_data['Text']).toarray()
predicted_labels = model.predict(X_unlabeled)

Here we created a series of columns for training on later, I need to fix these laels

In [ ]:
df4['original'] = df3['Sentiment']
df4.loc[df4['Sentiment'] == 'None', 'Sentiment'] = predicted_labels
df4['predicted'] = df4[['Sentiment']]
df4['Sentiment'] = df3[['Sentiment']]

Creating a final datafile with the correct labels

In [ ]:
df5 = df4[['Sentiment', 'original', 'predicted']]
df5
Out[ ]:
Sentiment original predicted
0 positive positive positive
1 positive positive positive
2 positive positive positive
3 negative None negative
4 positive positive positive
... ... ... ...
5919 positive positive positive
5920 negative negative negative
5921 negative negative negative
578 negative negative negative
1837 negative negative negative

5821 rows × 3 columns

Change all occurances of words into a score

Function that turns a dataframe into numerical values

In [ ]:
def ConvertSentiment(score):
  if score == 'positive':
    return 1
  elif score == 'neutral':
    return 0
  else:
    return 2
  return score

Applying a score for each sentiment

In [ ]:
df5['Sentiment'] = df5['Sentiment'].apply(ConvertSentiment)
df5['predicted']  = df5['predicted'].apply(ConvertSentiment)
In [ ]:
df5.head(60)
Out[ ]:
Sentiment original predicted
0 1 positive 1
1 1 positive 1
2 1 positive 1
3 2 None 2
4 1 positive 1
5 1 None 1
6 1 None 1
7 1 None 1
8 1 positive 1
9 1 positive 1
10 1 positive 1
11 1 positive 1
12 1 positive 1
13 2 negative 2
14 1 positive 1
15 1 positive 1
16 1 positive 1
17 1 positive 1
18 1 positive 1
19 1 None 1
20 1 positive 1
21 1 positive 1
22 1 positive 1
23 1 positive 1
24 1 None 1
25 1 positive 1
26 1 None 1
27 1 None 1
28 1 positive 1
29 1 positive 1
30 1 positive 1
31 1 None 1
32 1 None 1
33 1 positive 1
34 1 positive 1
35 1 positive 1
36 1 None 1
37 1 positive 1
38 2 negative 2
39 2 negative 2
40 1 None 1
41 1 positive 1
42 2 negative 2
43 2 negative 2
44 2 negative 2
45 0 None 0
46 2 None 2
47 2 negative 2
48 2 None 2
49 2 None 2
50 2 negative 2
51 2 negative 2
52 2 negative 2
53 2 negative 2
54 2 negative 2
55 2 negative 2
56 2 negative 2
57 2 negative 2
58 2 None 2
59 2 negative 2

Loaded and caluculated the error for the model

In [ ]:
from sklearn.metrics import mean_squared_error

predicted_values = df5['predicted']
actual_values = df5['Sentiment']

mse = mean_squared_error(actual_values, predicted_values)
rmse = mse ** 0.5

print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Squared Error (MSE): 0.0
Root Mean Squared Error (RMSE): 0.0

Save Model¶

We save the model. We will use this saved model to build our web application

In [ ]:
import joblib
In [ ]:
joblib.dump(input_vector, 'vector.pkl')
Out[ ]:
['vector.pkl']
In [ ]:
joblib.dump(model, 'senitment_analysis.pkl')
Out[ ]:
['senitment_analysis.pkl']

Test Model¶

Let's quickly test our modal to see it predicts based on our analysis

In [ ]:
test_model = joblib.load('senitment_analysis.pkl')
In [ ]:
result = test_model.score(X, y)
print(result)
0.8781235266383781

Conclusion¶

Hence, we can conclude that the Logistic Regression is the best model to use for our dataset. The Random forest model was a close second. In the end, we decided to go with Logistic Regression due to it's higher accuracy

Bibliography¶

Mehreen Saeed, Modeling Pipeline Optimization With scikit-learn URL - https://machinelearningmastery.com/modeling-pipeline-optimization-with-scikit-learn/

Pratik Parmar, Enable plotfly in a cell in colab URL - https://stackoverflow.com/a/54771665

Build a function to get dicitonaries -URL - https://stackoverflow.com/questions/8653516/search-a-list-of-dictionaries-in-python

Gilbert Tanner, Building web app with streamlit and deploying wit Heroku - URL - https://gilberttanner.com/blog/deploying-your-streamlit-dashboard-with-heroku/

M.A. Al-Barrak,Muna S. Al-Razgan, Predicting students’ performance through classification: Journal of Theoretical and Applied Information Technology 75(2):167-175 URL - https://www.researchgate.net/publication/282381796_Predicting_students'_performance_through_classification_A_case_study

Note from the Author¶

This file was generated using The NBConvert, additional information on how to prepare articles for submission is here.

The article itself is an executable colab Markdown file that could be downloaded from Github with all the necessary artifacts.

Link to the web application - Sentiment Analysis

Kunwar Rajdeep Singh - York University School of Continuing Studies